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Abstract 

We discuss a program suite for simulating Quantum Chromodynamics on a 4- 
dimensional space-time lattice. The basic Hybrid Monte Carlo algorithm is intro- 
duced and a number of algorithmic improvements are explained. We then discuss 
the implementations of these concepts as well as our parallelisation strategy in the 
actual simulation code. Finally, we provide a user guide to compile and run the 
program. 
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LONG WRITE-UP 



1 Introduction 

This contribution to the anniversary issue of CPC deals with the strong force 
in Particle Physics. The strong force is presumably the least well understood 
fundamental interaction between elementary particles. It is responsible for the 
existence of protons and neutrons, or more generally all nuclei, as bound states. 
The constituents of the nuclei are the quarks and gluons as the fundamental 
particles. It is interesting to observe that the energy (mass) of a proton has a 
size of about IGeV while the mass of the two constituent up and down quarks 
is at the order of only a few MeV. Hence, the by far biggest contribution to 
the proton mass is pure binding energy. 

This shows already that a description of the proton in terms of the underlying 
quark and gluon degrees of freedom must be highly non-trivial. The model that 
is believed to provide a theoretical framework for the strong interaction and 
should give such a description is Quantum Chromo dynamics (QCD). Although 
this theory can be written in a very compact mathematical form, it is a highly 
non-linear theory that does not allow for a closed analytical solution. 

However, and rather fortunately, QCD can be reformulated in such a way 
that computational physics methods can be applied to calculate properties 
of QCD from first principles and without relying on approximations. In this 
approach, space and time are rendered discrete and a lattice spacing a is 
introduced. Thus, a 4-dimensional space-time lattice is considered and the 
quark and gluon degrees of freedom are placed on the lattice points or on 
so-called links that connect lattice points. In this way one obtains a model 
of "quark" spins, which are coupled to nearest neighbours only, very much 
reminiscent of an Ising model in statistical physics. Indeed, the methods of 
statistical physics, namely the evaluation of the partition function by means 
of numerical simulations using Monte Carlo methods employing importance 
sampling, are the key to address QCD on the 4-dimensional lattice, which we 
will refer to as lattice QCD (LQCD). 

Although the concepts to treat LQCD numerically are very clear, the problem 
has an intrinsically extremely high computational demand. The crucial factor 
is that in the end the introduced discretisation has to be removed again. If 
we consider a lattice with say, L = 3 fm linear extent and a lattice spacing of 
a = 0.1 fm then we would have to use iV = 3/0.1 = 30 lattice points in one 
direction. Since we are dealing with a 4-dimensional problem, we need hence 
30"^ lattice points for such a more or less reasonable physical situation. Simu- 
lations on such a lattice require in LQCD already several Teraflops. Keeping 
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Fig. 1. Computer resources needed to generate 1000 independent configurations of 
size 24^ X 40 at a lattice spacing of about 0.08 fm with pure Wilson fermions in 
units of Tflops • years as a function of mps/my. In (a) we show the prediction of 
Ref. [Ij from the 2001 lattice conference in Berlin (hence titled "Berlin Wall"). In 
(b) we compare to the formula of Ref. [1] (solid line) with the performance of the 
algorithm described in this paper and first published in Ref. [2]. The dashed line is 
to guide the eye. The scale of the vertical axis changes by about a factor of 1/50 
from (a) to (b). In (a) and (b) the arrow indicates the physical pion to rho meson 
mass ratio. Note that all the cost data were scaled to match a lattice time extend 
of T = 40. 

L = 3 fm fixed and decreasing the lattice spacing to remove the discretisation 
by, say, a factor of two increases the cost of the simulation already by a factor 
2^. As this would not be worse enough, the used algorithms contribute another 
factor of 2'^^"^). Hence, going to really fine discretisations where the effects of 
the non-zero lattice spacing can safely be neglected or at least a controlled 
extrapolation to zero values of the lattice spacing can be performed is an ex- 
tremely demanding computational challenge which will finally require at least 
Petaflops computing, an area of computing power we are realising today. 

However, even with the advent of Petaflops computers, the goal of "solving" 
QCD on a lattice would be completely out of reach without some algorithmic 
improvements that were invented in recent years. This is shown in fig. [H In 
the left panel of the graph, the number of Terafiops years for a certain typical 
simulation is shown as a function of the ratio of two meson masses, the pseudo 
scalar and the vector meson. The graph derives from the known cost of the 
used algorithm in the year 2001 p^. 

It is important to realise that the meson mass ratio used in fig. [1] assumes a 
value of about 0.2 in nature as measured in experiments, which is indicated 
by the arrow in both panels of fig. [D The figure clearly demonstrates the 
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strongly growing cost of the simulations when the real physical situation is 
to be reached. In fact, simulations directly at the physical value of this mass 
ratio were impossible in 2001. The right panel of the graph demonstrates the 
change of the situation when algorithmic improvements are used as the ones 
described in this article. In fact, the simulation costs shown in the right panel 
of fig. [1] originate from direct performance measurements of the code that is 
described here. As the figure demonstrates, although the simulations at the 
physical value of the meson mass ratio are still rather demanding, they become 
clearly realistic with todays Petafiops systems. There are other approaches to 
improve the HMC algorithm with similar results PIUISIISIITIIS] . Very promising is 
the recent additional improvement using inexact deflation presented in Ref. [9j. 

The algorithmic improvements provided therefore a tremendous gain opening 
a way for simulations in LQCD that were unthinkable a few years ago. It 
is precisely the goal of this contribution to describe one programme version 
of the underlying Hybrid Monte Carlo (HMC) algorithm where a number of 
such improvements have been incorporated and to make the corresponding 
code publicly available. The current version of the code and future updates 
can be downloaded from the web ^lOj . 



2 Theoretical background 



2.1 QCD on a lattice 



Quantum Chromo dynamics on a hyper-cubic Euclidean space-time lattice of 
size X T with lattice spacing a is formally described by the action 



with Sq some suitable discretisation of the the Yang-Mills action F^^^/A [llj . 
The particular implementation we are using can be found below in section 4.2 
and consists of plaquette and rectangular shaped Wilson loops with particu- 
lar coefficients. D is a discretisation of the Dirac operator, for which Wilson 
originally proposed p!2] to use the so called Wilson Dirac operator 



S = SG[U]+a^^tlj D[U]^P 



(1) 



X 
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with and V* the forward and backward gauge covariant difference opera- 
tors, respectively: 



a 



U{x,x + afi)4'{x + afi) — ip{x 
ip{x) — U\x, X — afi)il){x — afj) 



(3) 



where we denote the SU(3) hnk variables by Ux,^- We shall set a = 1 in 
the following for convenience. Discretising the theory is by far not a unique 
procedure. Instead of Wilson's original formulation one may equally well chose 
the Wilson twisted mass formulation and the corresponding Dirac operator 



{Dw[U] + mo) 1/ + ifig-f5'r^ 



(4) 



for a mass degenerate doublet of quarks. We denote by mo the bare (Wilson) 
quark mass, Hq is the bare twisted mass parameter, r* the i-th Pauli matrix and 
1/ the unit matrix acting in flavour space (see appendix |X] for our convention). 
In the framework of Wilson twisted mass QCD only flavour doublets of quarks 
can be simulated, however, the two quarks do not need to be degenerate in 
mass. The corresponding mass non-degenerate flavour doublet reads [14] 



(5) 



Note that this notation is not unique. Equivalently - as used in Ref. [15j - one 
may write 

D'f^dl^, Us) = ■ If + i'j5flaT^ + flsT^ , (6) 

which is related to Dh by D'f^ = {l + iT'^)Dh{l — 'ir^)/2 and (/io-, fis) {fi, — e)- 



2.2 The Hybrid Monte Carlo Algorithm 



For the purpose of introducing the Hybrid Monte Carlo (HMC) algorithm 
we shall consider only the Wilson twisted mass formulation of lattice QCD 
with oneq doublet of mass degenerate quarks with bare quark mass mo and 
bare twisted mass fig. The extension to more than one flavour doublet of 
quarks is straightforward. The corresponding polynomial HMC algorithm used 
for simulating the mass non-degenerate flavour doublet is discussed in the 
following sub-section. 

After integrating out the Grassmann valued fermion fields, in lattice QCD one 
needs to evaluate the integral 

I VU det(QtQ) e-^G ^ (7) 
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by Markov Chain Monte Carlo methods with some discretisation of the Yang- 
Mills gauge action Sq and 

Q = j^Dy^P] + 'j^rrio + ifXg , (8) 

with the Wilson-Dirac operator Dy^ of eq. ([2]). Note that Q acts now on one 
flavour only. The determinant can be re-expressed using complex valued, so- 
called pseudo fermion fields and 0^ 

det(Q2) oc JV(j) V(f)^ e-'^Q-'<t>,Q-''t^) (g) 

where S-p-p = \Q^^(f)\'^ is called the pseudo fermion action. The HMC algo- 
rithm [in] is then defined by introducing traceless hermitian momenta Px^^ 
(conjugate to the fundamental fields t/x.^t) and a Hamiltonian 

m, p) = i: ^t^apL] + sg[u] + Spp[u] . (10) 

Given 7Y, the algorithm is composed out of a molecular dynamics update of 
the fields {U,P) {U',P') and a Metropolis accept/reject step with respect 
to H using the acceptance probability 

Pace = min(l, exp {n{U', P') - n{U, P)) . (11) 

While the momenta P are generated at the beginning of a trajectory - in the 
so called heat-bath step - randomly from a Gaussian distribution, the pseudo 
fermion fields (p are generated by first generating random fields r and then 

(f) = Qr 

such that exp{ — {Q^^(j),Q~^(j))} = exp{rV}. Note that the pseudo fermion 
fields are not evolved during the molecular dynamics part of the HMC algo- 
rithm. 



2.2.1 Molecular Dynamics Update 

In the molecular dynamics (MD) part of the HMC algorithm the momenta 
and gauge fields are updated corresponding to the Hamiltonian equations of 
motion 

* (12) 



dr 



Ux,^ Px,fiUx,fi 



with respect to a fictitious computer time r and forces F which are obtained 
by differentiating the action with respect to the gauge fields U, and takes 
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values in the Lie algebra of SU(3). The differentiation Du of some function 
f{U) is defined as 

where t° are the generators of su(3). 

Since these equations can in general not be integrated analytically, one uses 
numerical integration methods, which must be area preserving and reversible. 
Symmetrised symplectic integrators fulfil these requirements with the simplest 
example being the leap-frog algorithm. The basic discrete update steps with 
integration step size Ar of the gauge field and the momenta can be defined as 

Tu(Ar): U ^ f/' = exp ArP) f/ , 

Ts(Ar) : P P' = P-iATF. ^ ^ 

The leap-frog algorithm is then obtained by sequential application of 

T = Ts(Ar/2)Tu(Ar)Tg(Ar/2), 

i.e. for a trajectory of length r one needs to apply T^^^^ with A^md = t/At. 



2.2.2 Preconditioning and Multiple Time Scales 
Preconditioning is usually performed by factorising 

det(g^Q) = det{R\Ri) ■ det(P^i?2) • • • det(i?J,i?„) (14) 

with suitably chosen Ri,R2,...Rn- Then For every Ri a separate pseudo 
fermion field 0j is introduced, such that the Hamiltonian reads 

1 " 

nu, p) = E T^npL] + sg[u] + E ^PF,: • (15) 

and the equations of motion are changed to 

d " 

d 

where we identify Fq with the force stemming from the gauge action Sq. 



The factorisation in eq. flMj) can be achieved in many different ways, see for 
instance Refs. [3|Hl5|6|71l8] . Here we shall only discuss what is known as mass 
preconditioning or Hasenbusch trick [T7|18|19] . It is obtained by writing the 



identity 



,M(Q^Q)^.MiWHV).^g§-^, (16) 
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where 

W = Dw + mo + 2/^275, fJ'2> fig- 

By adjusting the the additional mass parameter /i2, the condition number of 
W^'W and {Q''Q)/iW^W) can both be reduced with respect to the one of Q'^Q 
alone. As argued in Ref. [2^, the optimal choice leads to a condition number 
of ^/k for both W^W and {Q^Q)/{W^W), where k is the condition number of 
Q'^Q. A reduced condition number leads to reduced force contributions in the 
MD evolution and allows hence for larger values of At. 

It is important to notice that evaluating the force contribution stemming from 
(Q^Q) /(W^W) is more expensive in terms of computer time than the evalua- 
tion of the contribution from W^W, since it involves the iterative solution of 
(p = {Q^Q)~^4' with the large condition number k. Thus, the algorithm might 
be further improved by not tuning the condition numbers equal as explained 
beforehand, but by introducing a multiple time scale integration scheme as 
follows. 

Considering a Hamiltonian like in eq. ( |T5l) we may introduce n + 1 timescales 
Axj with 

At, = — Nmb, = Nn ■ Nn-i ■■■N, 
JVmd, 

and basic discrete update steps 

Tu(Ar): U ^ U' = exp (zAtP) U , 

TsXAt): P P' = P-tATFi ^ ^ 

with < i < n. We have identified Sq = Sq. The leap frog update on timescale 
i is then recursively defined as 



T,; 



Ts, {ATi/2)Tv{ATi) Ts, {An/2) i = 
Ts,(Ar,/2) [r,_i]^>- rs,(Ar,/2) < z < n 



and the full trajectory of length r is eventually achieved by [T„ 



Nn 



As was shown in Ref. [2] - and for other factorisations of the determinant 
in Refs. [3]l7f2T] - the combination of multiple time scale integration and a 
determinant factorisation allows to set the algorithm up such that the most 
expensive operator contributes least to the MD forces. It can then be inte- 
grated on the outermost timescale and must be less often inverted. 



2.2.3 Integration Schemes 

During the last paragraphs we have introduced the simplest reversible and 
area preserving integration scheme, known as leap frog integration scheme. 
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There are more involved integration schemes available, partly or completely 
cancelling higher order discretisation errors. 

It turns out that completely cancelling higher order effects is not necessary 
and often even not efficient. Integration schemes with reduced errors are for 
example the so called n-th order minimal norm integration schemes, for details 
see Ref. [22] and references therein. The second order minimal norm (2MN) 
integration scheme is based on the update step 

Tso(AoAro) Tu(Aro/2) Ts,((l - 2Ao)Aro) 
Tu(Aro/2) Ts„(AoAro), ^^^^ 
Ts,(A,Ar,) [7;2MN]A.._. Ts,((1 - 2A,)Ar,) 

Aj is a dimensionless parameter and the 2MN scheme coincides with the 
Sexton- Weingarten scheme [22] in case Aj = 1/6. The optimal value for Aj 
was given in Ref. [22] to be around 0.19. But its value is likely to depend on 
the mass values and the time scale under consideration. Note that there is a 
parameter Aj for each timescale Atj, which can be tuned separately. 

While all the integration schemes introduced so far were based on the order 
Ts Tu Ts, it is also possible to revert this order. In this case one talks about 
the position version of the corresponding integration scheme, while the usual 
one is called the velocity version. Under certain circumstances they can be 
more efficient, because one less application of Ts is needed. The corresponding 
update steps can be easily derived from the formulae provided above. 

2.3 Polynomial HMC for a non-degenerate doublet 

In the framework of Wilson twisted mass fermions it is only possible to sim- 
ulate flavour doublets of quarks. Hence, if one wants to include the strange 
quark in the simulation one also needs to include the charm. The correspond- 
ing mass non-degenerate doublet was defined in equation Simulating such 
a flavour doublet operator is possible using the polynomial HMC (PHMC) al- 
gorithm [2H25f26j . The basic problem that occurs in the mass non-degenerate 
case is that a single flavour has to be taken into account or equivalently the 
determinant of a single operator Q needs to be treated. The PHMC algorithm 
can solve this problem elegantly. 

The idea of the PHMC is based on writing 

det(g) = det(^) ^ det{p-^{Q^)) oc J V(f) Vcf)^ e-<^^^^ 



^0 



2MN 



T 



2MN 
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valid as long as Q is positive. Pe,n(Q^) is a polynomial approximation of 
of degree n in the interval [e, 1] 

P„,,(s) = ^{l + i?„,,}, s = Q\ (20) 



Rn,e is the error term. It can be shown that for the case of Chebysheff polyno- 
mials \R\ vanishes exponentially fast with the degree n (for large n). For more 
details regarding this issue we refer the reader for instance to Refs. [27f28] and 
references therein. 

It is worth noticing that representing in inverse operator by a polynomial has 
conceptual advantages. It allows to treat certain regions of the eigenvalue spec- 
trum of the operator in different ways and to separate therefore the infrared 
from the bulk and ultraviolet parts of the spectrum. Although this has been 
the main underlying idea of the PHMC algorithm [2^25f26j we will use it here, 
however, only as a technical tool to treat single flavours in the simulations. 

For our purpose - introducing Qh = - we can rewrite the corresponding 

determinant 

det(g;,)oc J V<^ e-*tp„,.(s)$ ^ 

with s = QhQh and the pseudo fermion fields $ are now two flavour fields. 
Note that Dj^ = T^j5Dh'j5T^ ■ The application of the polynomial P to a pseudo 
fermion field $ can be performed by either using the Clenshaw recursion re- 
lation [29], or by using the product representation 



.1=1 



with Zi the complex roots of P and a suitably chosen normalisation constant 
c. The product representation is conveniently used in the MD update. For the 
choice of polynomials, the determination of their roots and how to order them 
to avoid round-off errors see appendix O 

The HMC algorithm requires an area preserving and reversible MD update 
procedure, however, there is no need to use in the MD update the same op- 
erator as in the heat-bath step. As long as the acceptance rate is sufficiently 
high, we are free to use any other operator in the update. In order to exploit 
this possibility we introduce a second more precise polynomial 

PmAs) = --^{l + Rm,s} (21) 
-< n,eV S 

which is used in the heat-bath step to generate the pseudo fermion fields from 
a random field R 

$ = PB^QhR 
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and in the acceptance step. The less precise polynomial P is then used only 
in the MD update. 

The polynomial degrees n, m and the approximation intervals have to be de- 
termined such as to guarantee a good approximation of 1/ in the range of 
eigenvalues of Q\Qh- One may also adopt a strategy to chose e or 5 larger 
than a few lowest eigenvalues of QhQh and use re-weighting to correct for 
this [21125] . 

Even/Odd preconditioning 

The (P)HMC algorithm is implemented using even/odd preconditioning [30f3T] 
which is discussed shortly in appendix [Bl We want to stress that although 
even/odd preconditioning is a rather technical step, it leads to a very impor- 
tant improvement of the algorithm performance and is a cornerstone of all 
HMC implementations in the field. 

2.4 Boundary Conditions 

The theory is discretised and put on a finite, hyper-cubic space-time lattice 
with extensions xT = Yi/iL^. The boundary conditions for the gauge fields 
Ux,^ are chosen to be periodic, i.e. 

where z> is a unit vector in direction v. For the fermionic fields ip{x) we allow 
for more general boundary conditions, namely so called twisted boundary 
conditions 

Periodic boundary conditions correspond to 6^ = 0, while anti-periodic bound- 
ary conditions are achieved by setting 6^ = 1. More generally one can realise 
with twisted boundary conditions arbitrary values of momentum transfer on 
the lattice by a convenient re-interpretation of the phases [32] . 



3 Overview of the software structure 

The general strategy of the tmLQCD package is to provide programs for the 
main applications used in lattice QCD with Wilson twisted mass fermions. 
The code and the algorithms are designed to be general enough such as to 
compile and run efficiently on any modern computer architecture. This is 
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Read Input From File 



Initialise Programme / Read Data from Disl< 



Do No. of Trajectories times 





iviolecular Dynamics Update 
Online Measurements 
Write Data to Disk 







Finalise 



Fig. 2. Flowchart for the hmc_tm executable 



achieved code-wise by using standard C as programming language and for 
parallelisation the message passing interface (MPI) standard version 1.1. 

Performance improvements are achieved by providing dedicated code for cer- 
tain widely used architectures, like PC's or the Blue Gene family. Dedicated 
code is mainly available for the kernel routine - the application of the Dirac 
operator, which will be discussed in detail in section 14. 1[ and for the commu- 
nication routines. 

The tmLQCD package provides three main applications. The first is an imple- 
mentation of the (P)HMC algorithm, the second and the third are executables 
to invert the Wilson twisted mass Dirac operator (jl]) and the non-degenerate 
Wilson twisted mass Dirac operator respectively. All three do have a wide 
range of run-time options, which can be influenced using an input file. The 
syntax of the input file is explained in the documentation which ships with the 
source code. The relevant input parameters will be mentioned in the following 
where appropriate, to ease usage. 

We shall firstly discuss the general layout of the three aforementioned appli- 
cations, followed by a general discussion of the parallelisation strategy used in 
all three of them. 
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3.1 hmc_tm 



In figure [2] the programme flow of tlie hmc_tm executable is depicted. In the 
first block the input file is parsed and parameters are set accordingly. Then 
the required memory is allocated and, depending on the input parameters, 
data is read from disk in order to continue a previous run. 

The main part of this application is the molecular dynamics update. For a 
number of trajectories, which must be specified in the input file, first a heat- 
bath is performed, then the integration according to the equations of motion 
using the integrator as specified in the input file, and finally the acceptance 
step. 

After each trajectory certain online measurements are performed, such as mea- 
suring the plaquette value. Other online measurements are optional, like mea- 
suring the pseudo scalar correlation function. 

3.1.1 command line arguments 

The programme offers command line options as follows: 

• -h I ? prints a help message and exits. 

• -f input file name. The default is hmc . input 

• -o the prefix of the output filenames. The default is output. The code will 
generate or append to two files, output .data and output .para. 

3.1.2 Input / Output 

The parameters of each run are read from an input file with default name 
hmc. input. If it is missing all parameters will be set to their default values. 
Any parameter not set in the input file will also be set to its default value. 

During the run the hmc_tm program will generate two output files, one called 
per default output .data, the other one output, para. Into the latter impor- 
tant parameters will be written at the beginning of the run. 

The file output. data has several columns with the following meanings 

(1) Plaquette value. 

(2) Ai7 

(3) exp(-Ai7) 

(4) a pair of integers for each pseudo fermion monomial. The first integer of 
each pair is the sum of solver iterations needed in the acceptance and 
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Fig. 3. Flowchart for the main part of the invert and invert_doublet executables. 

heatbath steps, the second is the sum of iterations needed for the force 
computation of the whole trajectory. 

(5) Acceptance (0 or 1). 

(6) Time in seconds needed for this trajectory. 

(7) Value of the rectangle part in the gauge action, if used. 

Every new run will append its numbers to an already existing file. 

In addition, the program will create a file history_hmc_tni. This file provides 
a mapping between the configuration number and its plaquette and Polyakov 
loop values. Moreover the simulation parameters are stored there and in case 
of a reread the time point can be found there. 

After every trajectory the program will save the current configuration in the 
file conf . save. 

3.2 invert and invert_doublet 

The two applications invert and invert_doublet are very similar. The main 
difference is that in invert the one flavour Wilson twisted mass Dirac op- 
erator is inverted, whereas in invert_doublet the non-degenerate doublet is 
inverted. 

The main part of the two executables is depicted in figure [31 Each measurement 
corresponds to one gauge configuration that is read from disk into memory. For 
each of these gauge configurations a number of inversions will be performed. 
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The sources can be either generated or read in from disk. In the former case 
the programme can currently generate point sources at random location in 
space time. In the latter case the name of the source file can be specified in 
the input file. 

The relevant Dirac operator is then inverted on each source and the result is 
stored on disk. The inversion can be performed with a number of inversion 
algorithms, such as conjugate gradient (CG), BiCGstab, and others |33]. And 
optionally even/odd preconditioning as described previously can be used. 

3.2.1 command line arguments 

The two programmes offer command line options as follows: 

• -h I ? prints a help message and exits. 

• -f input file name. The default is hmc . input 

• -o the prefix of the output filenames. The default is output. The code will 
generate or append to one file called output . para. 

3.2.2 Output 

The program will create a file called output . data with information about 
the parameters of the run. Of course, also the propagators are stored on disk. 
The corresponding file names can be influenced via input parameters. The file 
format is discussed in some detail in sub-section 14. 7[ 

One particularity of the invert_doublet program is that the propagators 
written to disk correspond to the two flavour Dirac operator of eq. i.e. 

essentially for compatibility reasons. For the two flavour components written 
the first is the would be strange component and the second one the would be 
charm one. 



3. 3 Parallelisation 

The whole lattice can be parallelised in up to 4 space-time directions. It is 
controlled with configure switches, see section 15. 2[ The Message Passing In- 
terface (MPI, standard version 1.1) is used to implement the parallelisation. 
So for compiling the parallel executables a working MPI implementation is 
needed. 
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Depending on the number of parallelised space-time directions the t-direction, 
the t- and x-direction, the t-, x- and ?/-direction or the t-, x- and y- and 
2;-direction are parallelised. 

The number of processors per space direction must be specified at run time, 
i.e. in the input file. The relevant parameters are NrXProcs, NrYProcs and 
NrZProcs. The number of processors in time direction is determined by the 
program automatically. Note that the extension in any direction must divide 
by the number of processors in this direction. 

In case of even/odd preconditioning further constraints have to be fulfilled: 
the local and the local product Lt x x Ly must both be even. 




Fig. 4. Boundary exchange in a two dimensional parallel setup. One can see that the 
internal boundary is send while the external one is received. The corners - needed for 
implementing improved gauge actions like the tree-level Symanzik improved gauge 
action ^34j - need a two step procedure. 

The communication is organised using boundary buffer, as sketched in figure HI 
The MPI setup is contained in the file mpi_init . c. The corresponding function 
must be called at the beginning of a main program just after the parameters 
are read in, also in case of a serial run. In this function also the various 
MPI_Datatypes are constructed needed for the exchange of the boundary fields. 
The routines performing the communication for the various data types are 
located in files starting with xchange_. 

The communication is implemented using different types of MPI functions. 
One implementation uses the MPI_Sendrecv function to communicate the 
data. A second one uses non-blocking MPI functions and a third one persistent 
MPI calls. See the MPI standard for details • On machines with network ca- 
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pable of sending in several directions in parallel the non-blocking version is the 
most efficient one. The relevant configure switches are — with-nonblockingmpi 
and — with-persistentmpi, the latter of which is only available for the Dirac 
operator with halfspinor fields, see section 14.11 



4 Description of the individual software components 



4-1 Dirac Operator 



The Dirac operator is the kernel routine of any lattice QCD application, be- 
cause its inverse is needed for the HMC update procedure and also for com- 
puting correlation functions. The inversion is usually performed by means of 
iterative solvers, hke the conjugate gradient algorithm, and hence the repeated 
application of the Dirac operator to a spinor field is needed. Thus the optimi- 
sation of this routine deserves special attention. 

At some space-time point x the application of a Wilson type Dirac operator 
is mainly given by 



(x) =(mo + 4r + 2/1575)?/; (x) 



1 ^ r 



[22] 



where r is the Wilson parameter, which we set to one in the following. The 
most computer time consuming part is the nearest neighbour interaction part. 

For this part it is useful to observe that 

has only two independent spinor components, the other two follow trivially. 
So only two of the components need to be computed, then to be multiplied 
with the corresponding gauge field [/, and then the other two components are 
to be reconstructed. 

The operation in eq. fl2^ must be performed for each space-time point x. If the 
loop over x is performed such that all elements of are accessed sequentially 
(one output stream), it is clear that the elements in -0 and U cannot be 
accessed sequentially as well. This non-sequential access may lead to serious 
performance degradations due to too many cache misses, because modern 
processing units have only a very limited number of input streams available. 

While the ip field is usually different from one to the next application of 



18 



the Dirac operator, the gauge field stays often the same for a large number 
of applications. This is for instance so in iterative solvers, where the Dirac 
operator is applied 0(1000) times with fixed gauge fields. Therefore it is useful 
to construct a double copy of the original gauge field sorted such that the 
elements are accessed exactly in the order needed in the Dirac operator. For 
the price of additional memory, with this simple change one can obtain large 
performance improvements, depending on the architecture. The double copy 
must be updated whenever the gauge field change. This feature is available in 
the code at configure time, the relevant switch is — with-gaugecopy. 

Above we were assuming that we run sequentially through the resulting spinor 
field (f). Another possibility is to run sequentially through the source spinor field 
ifj. Moreover, one could split up the operation (122|) following the standard trick 
of introducing intermediate result vectors with only two spinor components 
per lattice site. Concentrating on the hopping part only, we would have 

= Pi^2 (r - 7^)^(a;) . 
From we can then reconstruct the resulting spinor field as 

= Pl~il^^^{x + a/i, /i) 

Here we denote with P^^^ the projection to the two independent spinor com- 
ponents for 1 ± 7^ and with P^~^^ the corresponding reconstruction. The half 
spinor fields ip^ can be interlaced in memory such that iIj{x) as well as ^p^{x) 
are always accessed sequentially in memory. The same is possible for the gauge 
fields, as explained above. So only for we cannot avoid strided access. So far 
we have only introduced extra fields ip^, which need to be loaded and stored 
from and to main memory, and divided the Dirac operator into two steps 
(1231) and (1241) which are very balanced with regard to memory bandwidth and 
fioating point operations. 

The advantage of this implementation of the Dirac operator comes in the 
parallel case. In step fl23l) we need only elements of ip{x), which are locally 
available on each node. So this step can be performed without any commu- 
nication. In between step fl25]) and fl2^ one then needs to communicate part 
of (p^, however only half the amount is needed compared to a communication 
of ip. After the second step there is then no further communication needed. 
Hence, one can reduce the amount of data to be sent by a factor of two. 

There is yet another performance improvement possible with this form of the 
Dirac operator, this time for the price of precision. One can store the interme- 
diate fields with reduced precision, e.g. in single precision when the regular 
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(24) 



spinor fields are in double precision. This will lead to a result with reduced 
precision, however, in a situation where this is not important, as for instance in 
the MD update procedure, it reduces the data to be communicated by another 
factor of two. And the required memory bandwidth is reduced as well. This 
version of the hopping matrix (currently it is only implemented for the hopping 
matrix) is available at configure time with the switch — enable-half spinor. 

The reduced precision version (sloppy precision) is available through the input 
parameter UseSloppyPrecision. It will be used in the MD update where 
appropriate. Moreover, it is implemented in the CG iterative solver following 
the ideas outlined in Ref. [36] for the overlap operator. 

The various implementations of the Dirac operator can be found in the file 
D_psi . c and - as needed for even/odd preconditioning - the hopping matrix 
in the file Hopping_Matrix. c. There are many different versions of these two 
routines available, each optimised for a particular architecture, e.g. for the 
Blue Gene/P double hummer processor or the streaming SIMD extensions of 
modern PC processors (SSE2 and SSE3), see also Ref. [37]. Martin Liischer 
has made available his standard C and SSE/SSE2 Dirac operator [38] under 
the GNU General Public License, which are partly included into the tmLQCD 
package. 



4.1.1 Blue Gene Version 

The IBM PowerPC 450d processor used on the Blue Gene architecture pro- 
vides a dual FPU, which supports a set of SIMD operations working on 32 
special registers useful for lattice QCD. These operations can be accessed us- 
ing build in functions of the IBM XLC compiler. The file bgl . h contains all 
macros relevant for the Blue Gene version of the hopping matrix and the Dirac 
operator. 

A small fraction of half spinor version (see above) is given in algorithm [H 
which represents the operation ip~^ = kU P^^^(1 + 'jo)4'. After loading the 
components of ip into the special registers and prefetching the gauge field for 
the next direction (in this case 1 + 7i), P+^^(l + 7o)V' is performed. It is then 
important to load the gauge field U only once from memory to registers and 
multiply both spinor components in parallel. 

Finally the result is multiplied with k, (which inherits also a phase factor due 
to the way we implement the boundary conditions, see next sub-section) and 
stored in memory. 
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Algorithm !(/?+ = k t/P^^^j^l + 70)^/' 

1: // load components of ip into registers 

2: _bgUoad_rsO((*s).sO); 

3: _bgUoad_rsl((*s).sl); 

4: _bgUoad_rs2((*s).s2); 

5: _bgLload_rs3((*s).s3); 

6: // prefetch gauge field for next direction (1 + 71) 

7: _prefetch_su3(U+l); 

8: // do now first + 7o)^ 

9: _bgLvector_add_rs2_to_rsO_regO() ; 

1 : -bgLvect or _add_r s3 _to_rs 1 _reg 1 ( ) ; 

11: / /now multiply both components at once with gauge field U and k 

12 : _bgLsu3 _mult iply_double ( ( * U) ) ; 

13: _bgl_vector_cmplx_mul_double(kaO) ; 

14: / / store the result 

15: _bgl_store_regO_up( (*phi [ix] ) .sO) ; 

16: _bgl_store_regl_up( (*phi [ix] ) .si) ; 



4.1.2 Boundary Conditions 



As discussed previously, we allow for arbitrary phase factors in the boundary 
conditions of the fermion fields. This is conveniently implemented in the Dirac 
operator phase factor in the hopping term 

The relevant input parameters are ThetaT, ThetaX, ThetaY, ThetaZ. 



4.2 The HMC Update 



We assume in the following that the action to be simulated can be written as 

-^mo 110 m i al s 

S = Sq + ^ Spp- , 

i=l 

and we call - following the CHROMA notation [39] - each term in this sum a 
monomial. We require that there is exactly one gauge monomial Sq (which we 
identify with Sq in the following) and an arbitrary number of pseudo fermion 
monomials Spp^. 

As a data type every monomial must known how to compute its contribution 
to the initial Hamiltonian Ti at the beginning of each trajectory in the heat- 
bath step. Then it must know how to compute the derivative with respect to 
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Monomial 




Type 




pseudo fermion field 



heat bath function 



derivative function 



acceptance function 



timescale 



Fig. 5. Data type monomial and its components 

the gauge fields for given gauge field and pseudo fermion field needed for the 
MD update. And finally there must be a function to compute its contribution 
to the final Hamiltonian Ti' as used in the acceptance step. 

In addition for each monomial it needs to be known on which timescale it 
should be integrated. The corresponding data type is sketched in figure [5l 
The general definitions for this data type can be found in the file monomial . c. 

There are several sorts of monomials implemented: 

• DET: pseudo fermion representation of the (mass degenerate) simple deter- 
minant 



DETRATIO: pseudo fermion representation of the determinant ratio 
NDPOLY: polynomial representation of the (possibly non-degenerate) doublet 



[det(Q„d(e,/i)')] 



2^,11/2 



GAUGE: 



-E 

3 ^ 



Co E {1 - ^^T^^Ull]^)} + E {1 - ReTr(t/2f J} 



J 



The parameter Ci can be set in the input file and Cq = 1 — 8ci. Note that 
Ci = corresponds to the Wilson plaquette gauge action. 

The corresponding specific functions are defined in the files det_monomial . c, 
detratio_monomial . c, ndpoly_monomial . c and gauge_monomial . c. Additional 
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Algorithm 2 integrate 
Require: < rits < A^ts, t > 

1: At = r/noSteps[?2ts] 

2: for ? = to noSteps[nts] do 

3: if fits == 1 then 

4: updateGauge(Ar) 

5: else 

6: integrate(nts — 1, Ar) 
7: end if 

8: updateMomenta(Ar, monomialList[r;,ts]) 
9: end for 

monomials can easily be implemented by providing the corresponding func- 
tions as discussed above. 

The integration scheme is implemented recursively, as exemplified in algo- 
rithm [2] for the leap-frog integration scheme (where we skipped half steps for 
simplicity). The updateMomenta function simply calls the derivative func- 
tions of all monomials that are integrated on timescale rits and updates the 
momenta P according to the time step Ar. 

The recursive scheme for the integration can easily be extended to more in- 
volved integration schemes. The details can be found in the file integrator . c. 
We have implemented the leap-frog and the second order minimal norm [22] 
integrations schemes. They are named in the input file as LEAPFROG and 2MN, 
respectively. These two can be mixed on different timescales. In addition we 
have implemented a position version of the second order minimal norm inte- 
gration scheme, denoted by 2MNP0SITI0N in the input file. The latter must 
not be mixed with the former two. 

The MD update is summarised in algorithm [31 It computes the initial and 
final Hamiltonians and calls in between the integration function with the total 
number of timescales A^ts and the total trajectory length r. 

4-2.1 Reduced Precision in the MD Update 

As shortly discussed previously, as long as the integration in the MD update is 
reversible and area preserving there is large freedom in choosing the integration 
scheme, but also the operator: it is not necessary to use the Dirac operator 
here, it can be any approximation to it. This is only useful if the acceptance 
rate is not strongly affected by such an approximation. 

The code provides two possibilities to adapt the precision of the Dirac op- 
erator used in the MD update: the first is to reduce the precision in the 
inversions needed for the force computation. This causes reduced iteration 



23 



Algorithm 3 MD update 
1: H = n' = Q 

2: for Z = to A^monomials do 

3: Ti += monomial heat-bath-function 
4: end for 

5: integrate (A^ts, t) 

6: for 2 = to A^monomials do 

7: Ti' += monomial [i] ^acceptance-function 
8: end for 

9: accept with probability min{l, exp(— A7i)} 

numbers needed for the integration of one trajectory. The relevant input pa- 
rameter is ForcePrecision available for each monomial. The precision needed 
in the acceptance and/or heatbath step can be adjusted separately using 
AcceptancePrecision. It is advisable to have the acceptance precision al- 
ways close to machine precision. 

The second possibility for influencing the Dirac operator is given by the re- 
duced precision Dirac operator described in sub-section 14. ![ which is switched 
on with the UseSloppyPrecision input parameter. The two possibilities can 
also be used in parallel. 

Note that one should always test for reversibility violations as explained in 
sub- sect ion 14.31 

4-2.2 Chronological Solver 

The idea of the chronological solver method, or chronological solver guess 
(CSG) (or similar methods 00]) is to optimise the initial guess for the solu- 
tion used in the solver. To this end the history of iVcsc last solutions of the 
equation M^x = is saved and then a linear combination of the flelds Xi with 
coeflicients Cj is used as an initial guess for the next inversion. M stands for 
the operator to be inverted and has to be replaced by the different ratios of 
operators used in this paper. 

The coefficients q are determined by solving 

Y,x]M\,Ci = x](t> (25) 

i 

with respect to the coefficients q. This is equivalent to minimising the func- 
tional that is minimised by the CG inverter itself. 

The downside of this method is that the reversibility violations increase signif- 
icantly by one or two orders of magnitude in the Hamiltonian when the CSG 
is switched on and all other parameters are kept fixed. Therefore one has to 



24 



adjust the residues in the solvers, which increases the number of matrix vec- 
tor multiphcations again. Our experience is that the methods described in the 
previous sub-section are more effective in particular in the context of multiple 
time scale integration, because the CSG is most effective for small values of 
At. 

The input parameters is the CSGHistory parameter available for the relevant 
monomials. Setting it to zero means no chronological solver, otherwise this 
parameter specifies the number of last solutions iVcsc to be saved. 

4-3 Online Measurements 

The HMC program includes the possibility to perform a certain number of 
measurements after every trajectory online, whether or not the configuration 
is stored on disk. Some of those are performed per default, namely all that are 
written to the output file output. data: 

(1) the plaquette expectation value, defined as: 

where V is the global lattice volume. 

(2) the rectangle expectation value, defined as: 

(^) = I^ E ReTr(^ifj 

(3) An = l-{! -n and exp(-A7^). 

See the overview section for details about the output . data file. These observ- 
ables all come with no extra computational cost. 

Optionally, other online measurements can be performed, which - however - 
need in general extra inversions of the Dirac operator. First of all the compu- 
tation of certain correlation functions is implemented. They need one extra 
inversion of the Dirac operator, as discussed in Ref. [H], using the one-end- 
trick. Define a stochastic source ^ as follows 

lim[Ce,]=5.„ lim[e.e,] = 0. (26) 

it— >oo R^oo 

Here R labels the number of samples and i all other degrees of freedom. Then 

[0r0,1ii = Mr^'* . + noise , (27) 
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if was computed from 

0,^ = Mr^'a ■ 

Having in mind the 75-hermiticity property of the Wilson and Wilson twisted 
mass Dirac propagator Gu,d, i-e. 

Guix,y) = 75G'd(y,a;)^75 

it is clear that eq. flTTI) can be used to evaluate 

= (Tr[G„(0,t)75G'd(t,0)75]) = (Tr[G„(0, t)G„(0, t)^]) 

with only one inversion. But, even if the one gamma structure at the source 
is fixed to be 75 due to the 75-hermiticity trick, we are still free to insert any 
7-structure r at the source, i.e. we can evaluate any correlation function of 
the form 

Cpr(t) = (Tr[G„(0,t)75G,(t,0)r]) = (Tr[G„(0, t)G„(0, t)t75r]) . 
Useful combinations of correlation functions are (PP), (PA) and (PV), with 

P" = X75 yX , V° = X7m yX , = X757/.yX 

From (PP) one can extract the pseudo scalar mass, and - in the twisted 
mass case - the pseudo scalar decay constant. {PA) can be used together with 
(PP) to extract the so called PCAC quark mass and (PV) to measure the 
renormalisation constant Zy. For details we refer the reader to Ref. [H]. 

These online measurements are controlled with the two following input param- 
eters: Perf ormOnlineMeasurements to switch them on or off and to specify 
the frequency OnlineMeasurementsFreq. The three correlation functions are 
saved in files named onlinemeas .n, where n is the trajectory number. Every 
file contains five columns, specifying the type, the operator type and the Eu- 
clidean time t. The last two columns are the values of the correlation function 
itself, C{t) and C{—t), respectively. The type is equal to 1, 2 or 6 for the (PP), 
the (PA) and the (PV) correlation functions. The operator type is for online 
measurements always equal to 1 for local source and sink (no smearing of any 
kind), and the time runs from to T/2. Hence, C{—t) = C{T — t). C(— 0) and 
C(— T/2) are set to zero for convenience. 

In addition to correlation functions also the minimal and the maximal eigen- 
values of the (75-D)^ can be measured. 

An online measurement not related to physics, but related to the algorithm 
are checks of reversibility violations. The HMC algorithm is exact if and only 
if the integration scheme is reversible. On a computer with finite precision this 
is only guaranteed up to machine precision. These violations can be estimated 
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by integrating one trajectory forward and then backward in Monte Carlo time. 
The difference 5/S.l-i among the original Hamiltonian Ti and the final one Ti" 
after integrating back can serve as one measure for those violations, another 
one is provided by the difference among the original gauge field U and the 
final one U" 

where we indicate with the 6A that this is obtained after integrating a tra- 
jectory forward and backward in time. The results for 6AT-C and SAU are 
stored in the file return_check.data. The relevant input parameters are 
ReversibilityCheck and ReversibilityChecklnterval. 

4-4 Iterative Solver and Eigensolver 

There are several iterative solvers implemented in the tmLQCD package for 
solving 

for X- The minimal residual (MR), the conjugate gradient (CG), the con- 
jugate gradient squared (CGS), the generalised minimal residual (GMRES), 
the generalised conjugate residual and the stabilised bi-conjugate gradient 
(BiCGstab). For details regarding these algorithms we refer to Refs. [331H2] . 

For the hmc_tm executable only the CG and the BiCGstab solvers are available, 
while all the others can be used in the invert executables. Most of them are 
both available with and without even/odd preconditioning. For a performance 
comparison we refer to Ref. [I3ll36] . 

The stopping criterion is implemented in two ways: the first is an absolute 
stopping criterion, i.e. the solver is stopped when the squared norm of the 
residual vector (depending on the solver this might be the iterated residual or 
the real residual) fulfils 

II Il2 ^ 2 

Ir II < ^ • 

The second is relative to the source vector, i.e. 

||r||^ 2 

w^' ■ 

The value of and the choice of relative or absolute precision can be influenced 
via input parameters. 

The reduced precision Dirac operator, as discussed in sub-section 14.11 is avail- 
able for the CG solver. In the CG solver the full precision Dirac operator is 
only required at the beginning of the CG search, because the relative size of 
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the contribution to the resulting vector decreases with the number of itera- 
tions. Thus, as soon as a certain precision is achieved in the CG algorithm 
we can switch to the reduced precision Dirac operator without spoiling the 
precision of the final result. We switch to the lower precision operator at a 
precision of \fe in the CG search, when aiming for a final precision of e < 1. 

We note that in principle any combination of using reduced precision in one 
of the ways described in this paper is possible. However, one should always 
check that the true residual is as small as expected in case of an inversion and 
that the reversibility violations are small in case of a HMC simulation. 

The eigensolver used to compute the eigenvalues (and vectors) of (75-D)^ is the 
so called Jacobi-Davidson method [iHHS] . For a discussion for the application 
of this algorithm to lattice QCD we refer again to Ref. P3|l36] . 

All solver related files can be found in the sub-directory solver. Note that 
there are a few more solvers implemented which are, however, in an experi- 
mental status. 



^.5 Stout Smearing 



Smearing techniques have become an important tool to reduce ultraviolet fluc- 
tuations in the gauge fields. One of those techniques, coming with the advan- 
tage of being usable in the MD update, is usually called stout smearing [46j . 

The (n -|- 1)*^ level of stout smeared gauge links is obtained iteratively from 
the Vb^ level by 




We refer to the unsmeared ("thin") gauge field as U ^ = Vf^. The SU(3) 
matrices are defined via the staples C^: 



Tr 



Uf\x)C^f{x)-\..^. 
= E Pm. {u^''\x)Uj,-\x + uM-'>\x + A) 

f/(")\x - z>)f7j")(x - 0)Ujr\x - z> + A)) , 



where in general p^i, is the smearing matrix. In the tmLQCD package we have 
only implemented isotropic 4-dimensional smearing, i.e., p^^, = p. 



Currently stout smearing is only implemented for the invert executables. I.e. 
the gauge field can be stout smeared at the beginning of an inversion. The 
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input parameters are UseStoutSmearing, StoutRho and StoutNoIterations. 

4.6 Random Number Generator 

The random number generator used in the code is the one proposed by Martin 
Liischer and usually known under the name RANLUX [57] . A single and double 
precision implementation was made available by the author under the GNU 
General Public License and can be downloaded |1S]. For convenience it is also 
included in the tmLQCD package. 

4-7 10 Formats 

In this final subsection we specify the 10 formats used to store gauge config- 
urations, propagators and sources to disk. 

4-7.1 Gauge Gonfigurations 

For gauge configurations we use the International Lattice Data Grid (ILDG) 
standard as specified in Ref. P^fSD] . As the lime packaging library [JT] and 
ILDG standard allow additional - not required - records to be stored within 
the file, we currently add the following two records for convenience: 

(1) xlf-inf o: useful information about the gauge configuration, such as the 
plaquette value, and about the run and the algorithm and the code version 
used to generate it. 

(2) scidac-checksum: SCIDAC checksum of the gauge configuration. For the 
specification see 

The gauge configurations can be written to disk either in single or double 
precision. The relevant input parameter is GaugeConf igWritePrecision. On 
readin the precision is determined automatically. 

Note that the gauge configuration does not depend on the particular choice 
of the 7-matrices. 

4-7.2 Propagators 

We note at the beginning, that we do not use different 10 formats for source 
or sink fermion fields. They are both stored using the same lime records. 
The meta-data stored in the same lime-packed file is supposed to clarify all 
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other things. It is also important to reahse that the propagator depends on 
the 7-matrix convention used in the Dirac operator. For our convention see 
appendix \^ 

Here we mainly concentrate on storing propagators (sink). The file can contain 
only sources, or both, source and sink. We (plan to) support four different 
formats 

(1) (arbitrary number of) sink, no sources 

(2) (arbitrary number of) source/sink pairs 

(3) one source, 12 sink 

(4) one source, 4 sink 

This is very similar to the formats in use in parts of the US lattice community. 
We adopt the SCIDAC checksum |52] for the binary data. 

Source and sink binary data has to be in a separate lime record. The order in 
one file for the four formats mentioned above is supposed to be 

(1) sink, no sources: - 

(2) source/sink pairs: first source, then sink 

(3) one source, 12 sink: first source, then 12 sinks 

(4) one source, 4 sink: first source, then 4 sinks 

All fermion field files must have a record indicating its type. The record itself 
is of type propagator-type and the record has a single entry (ASCII string) 
which contains one of 

• DiracFermion_Sink 

• DiracFermion_Source_Sink_Pairs 

• DiracFermion_ScalarSource_TwelveSink 

• DiracFermion_ScalarSource_FourSink 

Those strings are also used in the input files for the input parameter PropagatorType. 
The binary data corresponding to one Dirac fermion field (source or sink) is 
then stored with at least two (three) records. The first is of type 
etmc-propagator-f ormat 
and contains the following information: 

<?xnil version="1.0" encoding="UTF-8"?> 
<etnicFormat> 

<f ield>diracFermion</f ield> 

<precision>32</precision> 

<f lavours>l</f lavours> 

<lx>4</lx> 

<ly>4</ly> 
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<lz>4</lz> 
<lt>4</lt> 
</etmcFormat> 

The flavours entry must be set to 1 for a one flavour propagator (flavour diag- 
onal case) and to 2 for a two flavour propagator (flavour non-diagonal 2-flavoiu^ 
operator). In the former case there follows one record of type scidac-binary-data, 
which is identical to the SCIDAC format, containing the fermion field. In the 
latter case there follow two of such records, the first of which is the upper 
flavour. To be precise, lets call the two flavours s (strange) and c (charm). 
Then we always store the s component first and then the c component. 

The first two types are by now supported in the tmLQCD package. In the 
future the other two might follow. 

The indices (time, space, spin, colour) in the binary data scidac-binary-data 
are in the following order: 

t, z, y, X, s, c , 

where t is the slowest and colour the fastest running index. The binary data 
is stored big endian and either in single or in double precision, depending on 
the precision entry in the etmc-propagator-f ormat record. 

In addition we store an additional record called inverter- info with useful 
information about the inversion precision, the physical parameters and the 
code version. 

4.7.3 Source Fields 

Source fields mentioned before, stored with the same binary data for- 

mat. There are again several types of source files possible: 

• DiracFermion_Source 

• DiracFermion_ScalarSource 

• DiracFermion_FourScalarSource 

• DiracFermion_TwelveScalarSource 

This type is stored in a record called source-type in the lime file. There 
might be several sources stored within the same file. We add a format record 
etmc-source-f ormat looking like 

<?xinl version="1.0" encoding="UTF-8"?> 
<etmcFormat> 

<f ield>diracFermion</ f ield> 

<precision>32</precision> 

<f lavours>l</f lavours> 
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<lx>4</lx> 
<ly>4</ly> 
<lz>4</lz> 
<lt>4</lt> 
<spin>4</ spin> 
<colour>3</ colour> 
</etmcFormat> 

with obvious meaning for every scidac-binary-data record within the hme 
packed file. This format record also allows to store a subset of the whole field, 
e.g. a time-shce. 



5 Installation instructions 

The software ships with a GNU autoconf environment and a configure script, 
which will generate GNU Makefiles to build the programmes. It is supported 
and recommended to configure and build the executables in a separate build 
directory. This also allows to have several builds with different options from 
the same source code directory. 

5. 1 Prerequisites 

In order to compile the programmes the LAPACK [53j library (Fortran ver- 
sion) needs to be installed. In addition it must be known which linker op- 
tions are needed to link against LAPACK, e.g. -Lpath-to-lapack -llapack 
-Iblas. Also a the latest version (tested is version 1.2.3) of C-LIME [51] must 
be available, which is used as a packaging scheme to read and write gauge 
configurations and propagators to files. 

5.2 Configuring the tmLQCD package 

In order to get a simple configuration of the hmc package it is enough to just 
type 

patch-to-src-code/conf igure — with-lime=<path-to-lime> \ 
— with-lapack=<linker-f lags> CC=<mycc> \ 
F77=<myf77> CFLAGS=<c-compiler flags> 

in the build directory. If CC, F77 and CFLAGS are not specified, configure 
will guess them. 
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The code was successfully compiled and run at least on the following platforms: 
1686 and compatible, x64 and compatible, IBM Regatta systems, IBM Blue 
Gene/L, IBM Blue Gene/P, SGI Altix and SGI PC clusters and powerpc 
clusters. 

The configure script accepts certain options to influence the building proce- 
dure. One can get an overview over all supported options with configure 
— help. There are enable I disable options switching on and off optional fea- 
tures and with I without switches usually related to optional packages. In the 
following we describe the most important of them (check configure — help 
for the defaults and more options): 

• — enable-mpi: 

This option switches on the support for MPI. On certain platforms it auto- 
matically chooses the correct parallel compiler or searches for a command 
mpicc in the search path. 

• — enable-p4: 

Enable the use of special Pentium4 instruction set and cache management. 

• — enable-opteron: 

Enable the use of special opteron instruction set and cache management. 

• — enable-sse2: 

Enable the use of SSE2 instruction set. This is a huge improvement on 
Pentium4 and equivalent systems. 

• — enable-sse3: 

Enable the use of SSE3 instruction set. This will give another 20% of 
speedup when compared to only SSE2. However, only a few processors are 
capable of SSE3. 

• — enable-gaugecopy: 

See section 14.11 for details on this option. It will increase the memory re- 
quirement of the code. 

• — enable-half spinor: 

If this option is enabled the Dirac operator using half spinor fields is used. 
See sub-section 14.11 for details. If this feature is switched on, also the gauge 
copy feature is switched on automatically. 

• — with-mpidimension=n: 

This option has only effect if the code is configured for MPI usage. The 
number of parallel directions can be specified. 1,2,3 and 4 dimensional par- 
allehsation is supported. 

• — with-lapack="<linker flags>": 

the code requires lapack to be linked. All linker flags necessary to do so 
must be specified here. Note that L1BS=" ..." works similar. 

• — with-limedir=<dir>: 

Tells configure where to find the lime package, which is required for the 
build of the HMC. It is used for the ILDG file format. 
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i±io 


1 rti 


i 1X2 


input-file 


sample- hmcO . input 


sample- hmc2. input 


sample-hmc3 .input 


r 3 w 
L X i 


/1 3 V/ /I 
4X4 


4X4 


4X4 


CI 


Wilson 


lioym 


Iwasaki 


a 
P 


o.u 


O.O 


1 Qc; 
i.yo 


K 


U.i77 


n 1 7 
0.17 


O.ibozoO 




n 1 77 


U.Ui 


U.UUZ / 4UyDl 






U.iiUO 




2Ke 




0.0935 




{P) 


0.62457(7) 


0.53347(17) 


0.5951(2) 


(R) 




0.30393(22) 


0.3637(3) 


Table 1 

Parameter and results for three sample input files as provided with the code. 



The configure script will guess at the very beginning on which platform the 
build is done. In case this fails or a cross compilation must be performed please 
use the option — host=HOST. For instance in order to compile for the BG/P 
one needs to specify — host=ppc-ibm-bprts — build=ppc64-ibin-linux. 



For certain architectures like the Blue Gene systems there are README. arch 
files in the top source directory with example configure calls. 

5.3 Building and Installing 

After successfully configuring the package the code can be build by simply 
typing make in the build directory. This will compile the standard executables. 
Typing make install will copy these executables into the install directory. 
The default install directory is $HOME/bin, which can be infiuenced e.g. with 
the — prefix option to configure. 



6 Test run description 

The source code ships with a number of sample input files. They are located 
in the sample-input sub-directory. They are small volume V — A'^ test runs 
designated to measure for instance the average plaquette values. 

Such a test-run can be performed for instance on a scalar machine by typing 



34 



. /hmc_tm -f sample-hmcO . input . 

Depending on the environment you are running in, you may need to adjust 
the input parameters to match the maximal run-time and so on. The expected 
average plaquette values are quoted in table [1] and also in the sample input 
files. 



6.1 Benchmark Executable 



Another useful test executable is a benchmark code. It can be build by typing 
make benchmark and it will, when run, measure the performance of the Dirac 
operator. It can be run in the serial or parallel case. It reads its input from a 
file benchmark . input and the relevant input parameters are the following: 

L = 4 
T = 4 

NrXProcs = 2 
NrYProcs = 2 
NrZProcs = 2 
UseEvenOdd = yes 
UseSloppyPrecision = no 

In case of even/odd preconditioning the performance of the hopping matrix is 
evaluated, in case of no even/odd the performance of the Dirac operator. The 
important part of the output of the code is as follows 

[. . .] 

(1429 Mflops [64 bit arithmetic]) 

communication switched off 
(2592 Mflops [64 bit arithmetic]) 

The size of the package is 36864 Byte 
The bandwidth is 662.91 + 662.91 MB/sec 

The bandwidth is not measured directly but computed from the performance 
difference among with and without communication and the package size. In 
case of a serial run the output is obviously reduced. 
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A 7 and Pauli Matrices 



In the following we specify our conventions for 7- and Pauli-matrices. 



A.l 'y -matrices 



We use the following convention for the Dirac 7-matrices: 



^ 0-10^ 
-1 



7o = 



-10 
0-100 



7i = 



A 



72 = 



^ 0-1 
00+10 
0+100 

y-l 



73 = 





-i 

+i 

\+i y 

^ -i ^ 

+z 

+i 

-i 



In this representation 75 is diagonal and reads 

Vl ^ 
0+100 

75 = 

0-10 
-ly 



A. 2 Pauli-matrices 



For the Pauli-matrices acting in flavour space we use the following convention: 



1 
1 



1 

1 0. 



-i 

1 



1 
-1 
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B Even/Odd Preconditioning 



B.l HMC Update 



In this section we describe liow even/odd [30|3T ] preconditioning can be used 
in the HMC algorithm in presence of a twisted mass term. Even/odd precon- 
ditioning is implemented in the tmLQCD package in the HMC algorithm as 
well as in the inversion of the Dirac operator, and can be used optionally. 

We start with the lattice fermion action in the hopping parameter represen- 
tation in the x-basis written as 



x] 



U{x, fi){r + j^)x{x + afi) 



(B.l) 



+ U\x — ajl, fi){r — 7^)x(x — a/t) 
J2x{x)M^yX{y) 



x,y 



similar to eq. (jlj). For convenience we define fi = 2K/i. Using the matrix M 
one can define the hermitian (two flavour) operator: 



Q = 75M 



Q- 



Q- 



(B.2) 



where the sub- matrices can be factorised as follows (Schur decomposition): 



1 ± i/i75 Me, 



75 



\Moe 



Moe 1 ± «yU75^ 



(B.3) 



Note that (M^^)'^ can be computed to be 



^1 1 =F W5 



1 + 



(B.4) 
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Using det{Q) — det{Q'^) det{Q ) the following relation can be derived 

det(g±)ocdet(g±) 

where is only defined on the odd sites of the lattice. In the HMC algorithm 
the determinant is stochastically estimated using pseudo fermion fields (po'- 

det(g+g-) = / Vcj)oVcj)l eM-SpF) 

J _^ (B.6) 

5pp= 0t (Q+Q-) 

where the fields 0o arc defined only on the odd sites of the lattice. In order 
to compute the force corresponding to the effective action S-pY we need the 
variation of -S'pf with respect to the gauge fields (using 5{A~^) — —A'^^AA'^y. 

SSpY = -[0l(Q+g-)-'<5Q+(Q+)-Vo + 0l(Q-)-'<^Q-(Q+Q-)-Vo] 
^-[XlSQ+Yo + Y^5Q-Xo\ 

with Xg and Yo defined on the odd sides as 

Xo = (Q+Q-)-Vo, n = (Q+)-Vo = Q-Xo , (B.8) 

where {Q'^Y = has been used. The variation of reads 

SQ^ = 75 {-5Moe{Mi)-^M,o - Moe{M^e)''SMeo) , (B.9) 

and one finds 



dSpF = -{X^5Q+Y + r^^Q-X) 
where X and Y are now defined over the full lattice as 



(B.IO) 



^^,-(M-)-M.X.\ ^J-(M.t)-M.n, ^^^^^ 

In addition 6Q~^ = SQ~^ , Mj^ = 'y^MoeJb and Mjg = 75Meo75 have been used. 
Since the bosonic part is quadratic in the (f)o fields, the 0o are generated at 
the beginning of each molecular dynamics trajectory with 

00 = Q-^To, (B.12) 

where Vg is a random spinor field taken from a Gaussian distribution with 
norm one. 
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B.1.1 Mass non-degenerate flavour doublet 

Even/odd preconditioning can also be implemented for the mass non-degenerate 
flavour doublet Dirac operator Dh eq. ([5]). Denoting 

the even/odd decomposition is as follows 



(75 + ifiT^ - er^' 



Q 



\Qoe 



Q'oe 
I 



(75 + Zyur^ 



er 



1 meY^Q 

Qt 



(B.13) 



where is given in flavour space by 



Qoo = 75 



1 1 _L ,-r,o, _ M,4l-ip.'y5)Meo 



'fc 1 J- + 1+^2 _,-2 



Moe{l-itJ.j5)Meo 
1+^*2 -e2 



with the previous deflnitions of Mgo etc. The implementation for the PHMC 
is very similar to the mass degenerate HMC case. 



B.2 Inversion 



In addition to even/odd preconditioning in the HMC algorithm as described 
above, it can also be used to speed up the inversion of the fermion matrix. Due 
to the factorisation fIB.SP the full fermion matrix can be inverted by inverting 
the two matrices appearing in the factorisation 













V 



ee, ^M,o 



M±0 

Moe 1 



The two factors can be simplifled as follows: 







M„e 1, 







±\-i 
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and 



-1 



1 (ML 

(M± - M„e(M±)-iM„ 

1 -(M±)-^Meo(M± - MUMte)-'M,o) 
(M± - M,,(M±)-iMe.)-i 



The complete inversion is now performed in two separate steps: first compute 
for a given source field = (0e, <Po) an intermediate result ip = {ipe, ^o) by: 





This step requires only the application of Moe and (M^)~^, the latter of which 
is given by eq. (lB.4p . The final solution ■?/' = {ipeyi^o) can then be computed 
with 



1 

(M± - MoeiM^.)-'M, 




"00 



where we defined 



Therefore, the only inversion that has to be performed numerically is the one 
to generate ipo from ipo and this inversion involves only an operator that is 
better conditioned than the original fermion operator. 

Even/odd preconditioning can also be used for the mass non-degenerate Dirac 
operator eq. ([5]). The corresponding equations follow immediately from the 
previous discussion and the definition from eq. fIB.lSp . 



C Initialising the PHMC 



The function 1 / ^/s in the interval [e, 1] can be approximated using polyno- 
mials or rational functions of different sorts. In the tmLQCD package we use 
Chebysheff polynomials, which are easy to construct. They can be constructed 
as to provide a desired overall precision in the interval [e, 1]. 

As discussed in sub-section 12. 3[ the roots of the polynomial Pn^e are needed 
for the evaluation of the force. Even though the roots come in complex conju- 
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gate pairs, for our case the roots cannot be computed analytically, hence we 
need to determine them numerically. Such an evaluation requires usually high 
precision. This is why these roots need to be determined before a PHMC run 
using an external program, i.e. they cannot be computed at the beginning of 
a run in the hnic_tm program. 

Such an external program ships with the tmLQCD code, which is located in 
the util/laguere directory ^ I. It is based on Laguerre's method and uses the 
Class Library for Numbers (CLN) [51], which provides arbitrary precision data 
types. In order to compute roots the CLN library must be available, which is 
free software. 

Taking for granted that the CLN library is available, the procedure for com- 
puting the roots is as follows: assuming the non-degenerate Dirac operator 
has eigenvalues in the interval [smin,5max], i-e. e = Smin/smax, and the poly- 
nomial degree is n. Edit the file chebyRoot.H and set the variable EPSILON 
to the value of e. Moreover, set the variable MAXPOW to the degree n. Adapt 
the Makefile to your local installation and compile the code by typing make. 
After running the ChebyRoot program successfully, you should find two files 
in the directory 

(1) Square_root_BR_roots . dat: 

which contains the roots of the polynomial in bit-reverse order [2^ . 

(2) normierungLocal.dat: 

which contains a normalisation constant. 

Copy these two files into the directory where you run the code and adjust the 
input parameters to match exactly the values used for the root computation. 
I.e. the input parameters StildeMin, StildeMax and DegreeOf MDPolynomial 
must be set appropriately in the NDPOLY monomial. 

The minimal and maximal eigenvalue of the non-degenerate flavour doublet 
can be computed as an online measurement. The frequency can be specified 
in the NDPOLY monomial with the input parameter ComputeEVFreq and they 
are written to the file called phmc.data. Note that this is not a cheap oper- 
ation in terms of computer time. However, if the approximation interval of 
the polynomial is chosen wrongly the algorithm performance might deterio- 
rate drastically, in particular if the upper bound is set wrongly. It is therefore 
advisable to introduce some security measure in particular in the value of Smax- 



^ We thank Istvan Montvay for providing us with his code. 
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